Bicycling is an activity which yields many benefits: Riders improve their health through exercise, while traffic congestion is reduced if riders move out of cars, with a corresponding reduction in pollution from carbon emissions. In recent years, Bike Sharing has become popular in a growing list of cities around the world. The NYC “CitiBike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served.
One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service in order to accomodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.)
We seek to examine CitiBike ridership data, joined with daily NYC weather data, to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day.
The goal is to estimate future demand which would enable the system operator to make expansion plans.
Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation. Additionally, ridership is segmented by by user_type (annual subscribers use the system much more heavilty than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s).
Bikeshare, Weather, Cycling, CitiBike, New York City
Since 2013 a shared bicycle system known as CitiBike has been available in New York City. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of CitiBike’s pricing constraints.
There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership in order to make good investments.
The available usage data provides a wealth of information which can be mined to seek trends in usage. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.
Because of weather, ridership is expected to be lower during the winter months, and on foul-weather days during the rest of the year, than on a warm and sunny summer day. Using the weather data we can seek to model the relationship between bicycle ridership and fair/foul or hot/cold weather.
What are the differences in rental patterns between annual members (presumably, local residents) vs. casual users (presumably, tourists?)
Is there any significant relationship between the age and/or gender of the bicycle renter vs. the rental patterns?
The rest of the paper proceeds as follows:
Westland et al. examined consumer behavior in bike sharing in Beijing using a deep-learning model incorporating weather and air quality, time-series of demand, and geographical location; later adding customer segmentation. [@Westland_Mou_Yin_2019]
Jia et al. performed a retrospective study of dockless bike sharing in Shanghai to determine whether introduction of such program increased cycling. Their methodology was to survey people in various neighborhoods where the areas were selected by sampling, and the individuals were selected by interviewing individuals on the street. [@Jia_Ding_Gebel_Chen_Zhang_Ma_Fu_2019]
Jia and Fu further examined whether dockless bicycle-sharing programs promote changes in travel mode in commuting and non-commuting trips, as well as the association between change in travel mode and potential correlates, as part of the same Shanghai study. [@Jia_Fu_2019]
Dell’Amico et al. modeled bike sharing rebalancing programs initially in Reggio Emilia, Italy using branch-and-cut algorithms. [@DellAmico_Hadjicostantinou_Iori_Novellani_2014]
In a more recent paper, Dell’Amico et al. examined the bike-sharing rebalancing problem with Stochastic Demands, aimed at determining minimum cost routes for a fleet of homogeneous vehicles in order to redistribute bikes among stations. [@DellAmico_Iori_Novellani_Subramanian_2018]
Zhou analyzed massive bike-sharing data in Chicago, constructing a bike flow similarity graph and using a fast-greedy algorithm to detect spatial communities of biking flows. He examined the questions 1. How do bike flow patterns vary as a result of time, weekday or weekend, and user groups? 2. Given the flow patterns, what was the spatiotemporal distribution of the over-demand for bikes and docks in 2013 and 2014? [@Zhou_2015]
Hosford et al. surveyed participants in Vancouver, Canada and determined that public bicycle share programs are not used equally by all segments of the population. In many cities, program members tend to be male, Caucasian, employed, and have higher educations and incomes compared to the general population. Further, their study determined that the majority of bicycle share trips replace trips previously made by walking or public transit, indicating that bicycle share appeals to people who already use active and sustainable modes of transportation [@Hosford_Lear_Fuller_Teschke_Therrien_Winters_2018]
In another paper, Hosford et al. determined that that the implementation of the public bicycle share program in Vancouver was associated with greater increases in bicycling for those living and working inside the bicycle share service area relative to those outside the service area in the early phase of implementation, but this effect did not sustain over time. [@Hosford_Fuller_Lear_Teschke_Gauvin_Brauer_Winters_2018]
Schmidt observed that the number of bike-sharing programs worldwide grew from 5 in 2005 to 1,571 in 2018. He further noted that disparities in bike-sharing usage are evident around the country, with users skewing towards younger white men. [@Schmidt_2018]
Wang et al. examined the rebalancing problem and determined that the fluctuation of the available bikes and docks is not only caused by the user but also by the operators’ own (inefficient) rebalancing activities; they propose a data-driven model to generate an optimal rebalancing model while minimizing the cost of moving the bikes. [@Wang_He_Zhang_Shu_Liu_Gu_Liu_Lee_Son_2018]
Vogel and Mattfeld observe that Short rental times and one-way use lead to imbalances in the spatial distribution of bikes at stations over time, and present a case study demonstrating that Data Mining applied to operational data offers insight into typical usage patterns of bike-sharing systems and is used to forecast bike demand with the aim of supporting and improving strategic and operational planning. They analyze both operational data from Vienna’s shared bike rental system as well as local weather data over the period. [@Vogel_Mattfeld_2011]
Fuller et al. examined the impact of a public transit strike (November 2016 in Philadelphia) on usage of the bike share service in that city. [@Fuller_Luan_Buote_Auchincloss_2019]
In an earlier study, Fuller et al. examined bikeshare in Montreal by collecting samples prior to the launch of the program, and following each of the first two seasons. [Unlike other cities such as New York, the Montreal bike share system does not operate year-round. Rather, because of the especially harsh winters, their bikeshare system is dismantled each fall and reinstalled each spring.] Fuller’s methodology incorporated a 5-step logistic regression in which the weather variables entered at step 4; this rendered nonsignificant the differences between the three survey periods. [@Fuller_Gauvin_Kestens_Daniel_Fournier_Morency_Drouin_2013]
Faghih-Imani and Eluru study the decision process involved in identifying destination locations after picking up the bicycle at a BSS station. In the traditional destination/location choice approaches, the model frameworks implicitly assume that the influence of exogenous factors on the destination preferences is constant across the entire population. They propose a finite mixture multinomial logit (FMMNL) model that accommodates such heterogeneity by probabilistically assigning trips to different segments and estimating segment-specific destination choice models for each segment. Unlike the traditional destination-choice-based multinomial logit (MNL) model or mixed multinomial logit (MMNL), in an FMMNL model, we can consider the effect of fixed attributes across destinations such as users’ or origins’ attributes in the decision process. [@Faghih-Imani_Eluru_2018]
An et al. examine weather and cycling in New York City and find that weather impacts cycling rates more than topography, infrastructure, land use mix, calendar events, and peaks. They do so by exploring a series of interaction effects, which each capture the extent to which two characteristics occurring simultaneously exert a combinatorial effect on cycling ridership – e.g, how is cycling impacted when it is both wet and a weekend day or humid day in the hilliest parts of the cycling network? [@An_Zahnow_Pojani_Corcoran_2019]
Heaney et al. examine the relation between ambient temperature and bikeshare usage and to project how climate change-induced increasing ambient temperatures may influence active transportation in New York City. [@Heaney_Carrión_Burkart_Lesk_Jack_2019]
In the 1990s, Nankervis examined the effect of weather and climate on university student bicycle commuting patterns in Melbourne, Australia by examining counts of parked bicycles at local universities and correlating with the weather for each day, finding that the deterrent effect of bad weather on commuting was less than commonly believed (though still significiant.) [@Nankervis_1999]
We obtained data from two sources:
CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.
For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)
There are currently 77 monthly data files for the New York City bikeshare system, spanning July 2013 through November 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. The total number of entries was more than 90 million, resulting in 17GB of data. Because of the computational limitations which this presented, we created samples of 1/1000 and 1/100 of the data. The samples were created deterministically, by subsetting the files on each 1000th (or, 100th) row.
Also we obtained historical weather information for 2013-2019 from the NCDC (National Climatic Data Center) by submitting an online request to https://www.ncdc.noaa.gov/cdo-web/search . Although the weather may vary slightly within New York City, we opted to use just the data associated with the Central Park observations as proxy for the entire city’s weather.
We believe that the above data provides a reasonable representation of the target population (all CitiBike rides) and the citywide weather.
load(file='DATA/CB.RData')
city_bike_df = as.data.frame(CB)
head(city_bike_df)
## trip_duration s_time e_time s_station_id s_station_name s_lat s_long
## 1 634 2013-07-01 00:00:00 2013-07-01 00:10:34 164 E 47 St & 2 Ave 40.75323 -73.97033
## 2 437 2013-07-01 06:54:02 2013-07-01 07:01:19 479 9 Ave & W 45 St 40.76019 -73.99126
## 3 1398 2013-07-01 08:03:38 2013-07-01 08:26:56 157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4 1124 2013-07-01 08:37:40 2013-07-01 08:56:24 496 E 16 St & 5 Ave 40.73726 -73.99239
## 5 1199 2013-07-01 09:16:59 2013-07-01 09:36:58 432 E 7 St & Avenue A 40.72622 -73.98380
## 6 221 2013-07-01 11:50:21 2013-07-01 11:54:02 475 E 16 St & Irving Pl 40.73524 -73.98759
## e_station_id e_station_name e_lat e_long bike_id user_type birth_year gender
## 1 504 1 Ave & E 15 St 40.73222 -73.98166 16950 Customer NA 0
## 2 243 Fulton St & Rockwell Pl 40.68798 -73.97847 16151 Subscriber 1987 1
## 3 375 Mercer St & Bleecker St 40.72679 -73.99695 15997 Subscriber 1987 1
## 4 500 Broadway & W 51 St 40.76229 -73.98336 17750 Subscriber 1959 2
## 5 466 W 25 St & 6 Ave 40.74395 -73.99145 17671 Subscriber 1983 2
## 6 537 Lexington Ave & E 24 St 40.74026 -73.98409 16490 Subscriber 1956 1
nrow(city_bike_df)
## [1] 92565
ncol(city_bike_df)
## [1] 15
# Weather data is obtained from the NCDC (National Climatic Data Center) via https://www.ncdc.noaa.gov/cdo-web/
# click on search tool https://www.ncdc.noaa.gov/cdo-web/search
# select "daily summaries"
# select Search for Stations
# Enter Search Term "USW00094728" for Central Park Station:
# https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail
# "add to cart"
weatherfilenames=list.files(path="./",pattern = '.csv$', full.names = T) # ending with .csv ; not .zip
#weatherfilenames
weatherfile <- "DATA/NYC_Weather_Data_2013-2019.csv"
## Perhaps we should rename the columns to more clearly reflect their meaning?
weatherspec <- cols(
STATION = col_character(),
NAME = col_character(),
LATITUDE = col_double(),
LONGITUDE = col_double(),
ELEVATION = col_double(),
DATE = col_date(format = "%F"), # readr::parse_datetime() : "%F" = "%Y-%m-%d"
#DATE = col_date(format = "%m/%d/%Y"), #col_date(format = "%F")
AWND = col_double(), # Average Daily Wind Speed
AWND_ATTRIBUTES = col_character(),
PGTM = col_double(), # Peak Wind-Gust Time
PGTM_ATTRIBUTES = col_character(),
PRCP = col_double(), # Amount of Precipitation
PRCP_ATTRIBUTES = col_character(),
SNOW = col_double(), # Amount of Snowfall
SNOW_ATTRIBUTES = col_character(),
SNWD = col_double(), # Depth of snow on the ground
SNWD_ATTRIBUTES = col_character(),
TAVG = col_double(), # Average Temperature (not populated)
TAVG_ATTRIBUTES = col_character(),
TMAX = col_double(), # Maximum temperature for the day
TMAX_ATTRIBUTES = col_character(),
TMIN = col_double(), # Minimum temperature for the day
TMIN_ATTRIBUTES = col_character(),
TSUN = col_double(), # Daily Total Sunshine (not populated)
TSUN_ATTRIBUTES = col_character(),
WDF2 = col_double(), # Direction of fastest 2-minute wind
WDF2_ATTRIBUTES = col_character(),
WDF5 = col_double(), # Direction of fastest 5-second wind
WDF5_ATTRIBUTES = col_character(),
WSF2 = col_double(), # Fastest 2-minute wind speed
WSF2_ATTRIBUTES = col_character(),
WSF5 = col_double(), # fastest 5-second wind speed
WSF5_ATTRIBUTES = col_character(),
WT01 = col_double(), # Fog
WT01_ATTRIBUTES = col_character(),
WT02 = col_double(), # Heavy Fog
WT02_ATTRIBUTES = col_character(),
WT03 = col_double(), # Thunder
WT03_ATTRIBUTES = col_character(),
WT04 = col_double(), # Sleet
WT04_ATTRIBUTES = col_character(),
WT06 = col_double(), # Glaze
WT06_ATTRIBUTES = col_character(),
WT08 = col_double(), # Smoke or haze
WT08_ATTRIBUTES = col_character(),
WT13 = col_double(), # Mist
WT13_ATTRIBUTES = col_character(),
WT14 = col_double(), # Drizzle
WT14_ATTRIBUTES = col_character(),
WT16 = col_double(), # Rain
WT16_ATTRIBUTES = col_character(),
WT18 = col_double(), # Snow
WT18_ATTRIBUTES = col_character(),
WT19 = col_double(), # Unknown source of precipitation
WT19_ATTRIBUTES = col_character(),
WT22 = col_double(), # Ice fog
WT22_ATTRIBUTES = col_character()
)
# load all the daily weather data
weather <- read_csv(weatherfile, col_types = weatherspec)
weather_df1 = as.data.frame(weather)
# Check the number of rows and columns in weather data frame
nrow(weather_df1)
## [1] 2541
ncol(weather_df1)
## [1] 56
# Select only those columns that are useful for our analysis
weather_df = select(weather_df1, STATION, NAME, DATE, AWND, PRCP, SNOW, SNWD, TMAX, TMIN, WDF2, WDF5, WSF2, WSF5, WT01)
# Check how many columns have empty values
sapply(weather_df, function(x) sum(is.na(x)))
## STATION NAME DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 0 0 2541 167 0 1 0 0 0 164 180 164 180 1696
# Perform Data Impuation on weather_df, replace empty/blank values with mean values
weather_df$AWND[is.na(weather_df$AWND)] = mean(weather_df1$AWND, na.rm=TRUE)
weather_df$SNOW[is.na(weather_df$SNOW)] = mean(weather_df1$SNOW, na.rm=TRUE)
weather_df$WDF2[is.na(weather_df$WDF2)] = mean(weather_df1$WDF2, na.rm=TRUE)
weather_df$WDF5[is.na(weather_df$WDF5)] = mean(weather_df1$WDF5, na.rm=TRUE)
weather_df$WSF2[is.na(weather_df$WSF2)] = mean(weather_df1$WSF2, na.rm=TRUE)
weather_df$WSF5[is.na(weather_df$WSF5)] = mean(weather_df1$WSF5, na.rm=TRUE)
# Again, check if the imputation removed all empty/blank values with mean values
sapply(weather_df, function(x) sum(is.na(x)))
## STATION NAME DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 0 0 2541 0 0 0 0 0 0 0 0 0 0 1696
# City bike data number of rows and columns
c(nrow(city_bike_df), ncol(city_bike_df))
## [1] 92565 15
# Weather data number of rows and columns
c(nrow(weather_df), ncol(weather_df))
## [1] 2541 14
# Check the column names of city_bike_df and weather_df
colnames(city_bike_df)
## [1] "trip_duration" "s_time" "e_time" "s_station_id" "s_station_name" "s_lat"
## [7] "s_long" "e_station_id" "e_station_name" "e_lat" "e_long" "bike_id"
## [13] "user_type" "birth_year" "gender"
colnames(weather_df)
## [1] "STATION" "NAME" "DATE" "AWND" "PRCP" "SNOW" "SNWD" "TMAX" "TMIN" "WDF2" "WDF5"
## [12] "WSF2" "WSF5" "WT01"
# Display head of city_bike_df and weather_df
head(city_bike_df)
## trip_duration s_time e_time s_station_id s_station_name s_lat s_long
## 1 634 2013-07-01 00:00:00 2013-07-01 00:10:34 164 E 47 St & 2 Ave 40.75323 -73.97033
## 2 437 2013-07-01 06:54:02 2013-07-01 07:01:19 479 9 Ave & W 45 St 40.76019 -73.99126
## 3 1398 2013-07-01 08:03:38 2013-07-01 08:26:56 157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4 1124 2013-07-01 08:37:40 2013-07-01 08:56:24 496 E 16 St & 5 Ave 40.73726 -73.99239
## 5 1199 2013-07-01 09:16:59 2013-07-01 09:36:58 432 E 7 St & Avenue A 40.72622 -73.98380
## 6 221 2013-07-01 11:50:21 2013-07-01 11:54:02 475 E 16 St & Irving Pl 40.73524 -73.98759
## e_station_id e_station_name e_lat e_long bike_id user_type birth_year gender
## 1 504 1 Ave & E 15 St 40.73222 -73.98166 16950 Customer NA 0
## 2 243 Fulton St & Rockwell Pl 40.68798 -73.97847 16151 Subscriber 1987 1
## 3 375 Mercer St & Bleecker St 40.72679 -73.99695 15997 Subscriber 1987 1
## 4 500 Broadway & W 51 St 40.76229 -73.98336 17750 Subscriber 1959 2
## 5 466 W 25 St & 6 Ave 40.74395 -73.99145 17671 Subscriber 1983 2
## 6 537 Lexington Ave & E 24 St 40.74026 -73.98409 16490 Subscriber 1956 1
head(weather_df)
## STATION NAME DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 1 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 6.93 0 0 0 40 26 310 300 15.0 25.9 NA
## 2 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 5.82 0 0 0 33 22 310 340 15.0 21.9 NA
## 3 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 4.47 0 0 0 32 24 260 260 13.0 19.9 NA
## 4 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 8.05 0 0 0 37 30 290 250 17.9 28.0 NA
## 5 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 6.71 0 0 0 42 32 310 310 17.0 25.9 NA
## 6 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 6.71 0 0 0 46 34 290 270 13.0 19.9 NA
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| supplied_secs | 61.0000000 | 373.0000000 | 618.0000000 | 906.8727165 | 1061.0000000 | 1688083.0000 |
| calc_secs | 60.0000000 | 373.7270000 | 618.6000001 | 907.3520470 | 1061.9250000 | 1688083.0000 |
| calc_mins | 1.0000000 | 6.2287833 | 10.3100000 | 15.1225341 | 17.6987500 | 28134.7167 |
| calc_hours | 0.0166667 | 0.1038131 | 0.1718333 | 0.2520422 | 0.2949792 | 468.9119 |
| calc_days | 0.0006944 | 0.0043255 | 0.0071597 | 0.0105018 | 0.0122908 | 19.5380 |
The above indicates that the duration of the trips (in seconds) includes values in the millions – which likely reflects a trip which failed to be properly closed out.
Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:
## [1] "Removed 158 trips (0.171%) of longer than 3 hours."
## [1] "Remaining number of trips: 92407"
Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.
The birth year for some users is as old as 1885, which is not possible:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1885 1969 1981 1978 1988 2003 5828
## [1] "Removed 41 trips (0.044%) of users older than 90 years."
## [1] "Removed 5828 trips (6.296%) of users where age is unknown (birth_year unspecified)."
## [1] "Remaining number of trips: 86538"
This is straight-line distance between (longitude,latitude) points – it doesn’t incorporate an actual bicycle route.
There are services (e.g., from Google) which can compute and measure a recommended bicycle route between points, but use of such services requires a subscription and incurs a cost.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5885 1.0477 1.3090 1.7280 10.6603
In this subset of the data, the maximum distance between stations is 10.6602792 km. In the data there are some stations for which the latitude and longitude are zero, which suggests that the distance between such a station and an actual station is many thousands of miles. If such items exist, we will delete them:
## [1] "No unusually long distances were found in this subset of the data."
There is a time-based usage fee for rides longer than an initial period:
#### Summary of trip durations AFTER censoring/truncation:
#express trip duration in seconds, minutes, hours, days
# note: we needed to fix the November daylight savings problem to eliminate negative trip times
#### Supplied seconds
#print("Supplied Seconds:")
supplied_secs<-summary(CB$trip_duration)
#### Seconds
CB$trip_duration_s = as.numeric(CB$e_time - CB$s_time,"secs")
calc_secs<-summary(CB$trip_duration_s)
#### Minutes
CB$trip_duration_m = as.numeric(CB$e_time - CB$s_time,"mins")
calc_mins<-summary(CB$trip_duration_m)
#### Hours
CB$trip_duration_h = as.numeric(CB$e_time - CB$s_time,"hours")
calc_hours<-summary(CB$trip_duration_h)
#### Days
CB$trip_duration_d = as.numeric(CB$e_time - CB$s_time,"days")
calc_days <-summary(CB$trip_duration_d)
# library(kableExtra) # loaded above
rbind(supplied_secs, calc_secs, calc_mins, calc_hours, calc_days) %>%
kable(caption = "Summary of trip durations - AFTER truncations:") %>%
kable_styling(c("bordered","striped"),latex_options = "hold_position")
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| supplied_secs | 61.0000000 | 363.0000000 | 592.0000000 | 775.7573898 | 994.0000000 | 10617.0000000 |
| calc_secs | 60.0000000 | 363.0000000 | 593.0000000 | 776.2454997 | 994.7922499 | 10617.5160000 |
| calc_mins | 1.0000000 | 6.0500000 | 9.8833333 | 12.9374250 | 16.5798708 | 176.9586000 |
| calc_hours | 0.0166667 | 0.1008333 | 0.1647222 | 0.2156237 | 0.2763312 | 2.9493100 |
| calc_days | 0.0006944 | 0.0042014 | 0.0068634 | 0.0089843 | 0.0115138 | 0.1228879 |
We could have chosen to censor the data, in which case we would not drop observations, but would instead move them to a limiting value, such as three hours (for trip time) or an age of 90 years (for adjusting birth_year).
As there were few such cases, we instead decided to truncate the data by dropping such observations from the dataset.
Because there is so much data, it is difficult to analyze the entire universe of trip-by-trip data unless one has high-performance computational resources.
We can examine the correlations between variables to understand the relationship between variables, and also to help be alert to potential problems of multicollinearity. Here we compute rank correlations (Pearson and Spearman) as well as actual correlations between key variables. Here we compute the correlations between key variables on the individual CitiBike Trip data. (Later we will compute correlations on daily aggregated data which has been joined with the daily weather observations.)
We will perform our calculations on an aggregated basis. We will group each day’s rides together, but we will segment by user_type (“Subscriber” or “Customer”) and by gender (“Male or”Female“). For each of these segments, there are some cases where the user_type is not specified, so we have designated that as”Unknown." For gender, there are cases where the CitiBike data set contains a zero, which indicates that the gender of the user was not recorded.
For each day, we will aggregate the following items across each of the above groupings:
We will split the aggregated data into a training dataset, consisting of all (grouped, daily) aggregations from 2013-2018, and a test dataset, consisting of (grouped, daily) aggregations from 2019.
We will then join each aggregated CitiBike data element with the corresponding weather obserservation for that date.
There are 5202 rows of daily aggregated data in the training dataset, and 1870 rows in the corresponding test dataset.
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
## parsnip model object
##
## Fit time: 13ms
##
## Call:
## stats::lm(formula = formula, data = data)
##
## Coefficients:
## (Intercept) DATE AWND TMAX TMIN
## -7.7267141 0.0005903 0.0073968 -0.0029396 0.0010834
## WDF2 WDF5 WSF2 WSF5 user_typeSubscriber
## 0.0002639 0.0001031 -0.0229798 0.0029510 1.5414800
## user_typeUNKNOWN train genderFemale genderUNKNOWN sum_duration
## -0.0673986 NA -2.1778470 -2.7037704 0.0004949
## median_duration sum_distance_km avg_age
## -0.0014653 0.3946859 0.0099778
## Random Forest Model Specification (regression)
##
## Computational engine: ranger
##
## iter imp variable
## 1 1 AWND WDF2 WDF5 WSF2 WSF5
## 1 2 AWND WDF2 WDF5 WSF2 WSF5
## 1 3 AWND WDF2 WDF5 WSF2 WSF5
## 1 4 AWND WDF2 WDF5 WSF2 WSF5
## 1 5 AWND WDF2 WDF5 WSF2 WSF5
## 2 1 AWND WDF2 WDF5 WSF2 WSF5
## 2 2 AWND WDF2 WDF5 WSF2 WSF5
## 2 3 AWND WDF2 WDF5 WSF2 WSF5
## 2 4 AWND WDF2 WDF5 WSF2 WSF5
## 2 5 AWND WDF2 WDF5 WSF2 WSF5
## 3 1 AWND WDF2 WDF5 WSF2 WSF5
## 3 2 AWND WDF2 WDF5 WSF2 WSF5
## 3 3 AWND WDF2 WDF5 WSF2 WSF5
## 3 4 AWND WDF2 WDF5 WSF2 WSF5
## 3 5 AWND WDF2 WDF5 WSF2 WSF5
## 4 1 AWND WDF2 WDF5 WSF2 WSF5
## 4 2 AWND WDF2 WDF5 WSF2 WSF5
## 4 3 AWND WDF2 WDF5 WSF2 WSF5
## 4 4 AWND WDF2 WDF5 WSF2 WSF5
## 4 5 AWND WDF2 WDF5 WSF2 WSF5
## 5 1 AWND WDF2 WDF5 WSF2 WSF5
## 5 2 AWND WDF2 WDF5 WSF2 WSF5
## 5 3 AWND WDF2 WDF5 WSF2 WSF5
## 5 4 AWND WDF2 WDF5 WSF2 WSF5
## 5 5 AWND WDF2 WDF5 WSF2 WSF5
## 6 1 AWND WDF2 WDF5 WSF2 WSF5
## 6 2 AWND WDF2 WDF5 WSF2 WSF5
## 6 3 AWND WDF2 WDF5 WSF2 WSF5
## 6 4 AWND WDF2 WDF5 WSF2 WSF5
## 6 5 AWND WDF2 WDF5 WSF2 WSF5
## 7 1 AWND WDF2 WDF5 WSF2 WSF5
## 7 2 AWND WDF2 WDF5 WSF2 WSF5
## 7 3 AWND WDF2 WDF5 WSF2 WSF5
## 7 4 AWND WDF2 WDF5 WSF2 WSF5
## 7 5 AWND WDF2 WDF5 WSF2 WSF5
## 8 1 AWND WDF2 WDF5 WSF2 WSF5
## 8 2 AWND WDF2 WDF5 WSF2 WSF5
## 8 3 AWND WDF2 WDF5 WSF2 WSF5
## 8 4 AWND WDF2 WDF5 WSF2 WSF5
## 8 5 AWND WDF2 WDF5 WSF2 WSF5
## 9 1 AWND WDF2 WDF5 WSF2 WSF5
## 9 2 AWND WDF2 WDF5 WSF2 WSF5
## 9 3 AWND WDF2 WDF5 WSF2 WSF5
## 9 4 AWND WDF2 WDF5 WSF2 WSF5
## 9 5 AWND WDF2 WDF5 WSF2 WSF5
## 10 1 AWND WDF2 WDF5 WSF2 WSF5
## 10 2 AWND WDF2 WDF5 WSF2 WSF5
## 10 3 AWND WDF2 WDF5 WSF2 WSF5
## 10 4 AWND WDF2 WDF5 WSF2 WSF5
## 10 5 AWND WDF2 WDF5 WSF2 WSF5
## 11 1 AWND WDF2 WDF5 WSF2 WSF5
## 11 2 AWND WDF2 WDF5 WSF2 WSF5
## 11 3 AWND WDF2 WDF5 WSF2 WSF5
## 11 4 AWND WDF2 WDF5 WSF2 WSF5
## 11 5 AWND WDF2 WDF5 WSF2 WSF5
## 12 1 AWND WDF2 WDF5 WSF2 WSF5
## 12 2 AWND WDF2 WDF5 WSF2 WSF5
## 12 3 AWND WDF2 WDF5 WSF2 WSF5
## 12 4 AWND WDF2 WDF5 WSF2 WSF5
## 12 5 AWND WDF2 WDF5 WSF2 WSF5
## 13 1 AWND WDF2 WDF5 WSF2 WSF5
## 13 2 AWND WDF2 WDF5 WSF2 WSF5
## 13 3 AWND WDF2 WDF5 WSF2 WSF5
## 13 4 AWND WDF2 WDF5 WSF2 WSF5
## 13 5 AWND WDF2 WDF5 WSF2 WSF5
## 14 1 AWND WDF2 WDF5 WSF2 WSF5
## 14 2 AWND WDF2 WDF5 WSF2 WSF5
## 14 3 AWND WDF2 WDF5 WSF2 WSF5
## 14 4 AWND WDF2 WDF5 WSF2 WSF5
## 14 5 AWND WDF2 WDF5 WSF2 WSF5
## 15 1 AWND WDF2 WDF5 WSF2 WSF5
## 15 2 AWND WDF2 WDF5 WSF2 WSF5
## 15 3 AWND WDF2 WDF5 WSF2 WSF5
## 15 4 AWND WDF2 WDF5 WSF2 WSF5
## 15 5 AWND WDF2 WDF5 WSF2 WSF5
## 16 1 AWND WDF2 WDF5 WSF2 WSF5
## 16 2 AWND WDF2 WDF5 WSF2 WSF5
## 16 3 AWND WDF2 WDF5 WSF2 WSF5
## 16 4 AWND WDF2 WDF5 WSF2 WSF5
## 16 5 AWND WDF2 WDF5 WSF2 WSF5
## 17 1 AWND WDF2 WDF5 WSF2 WSF5
## 17 2 AWND WDF2 WDF5 WSF2 WSF5
## 17 3 AWND WDF2 WDF5 WSF2 WSF5
## 17 4 AWND WDF2 WDF5 WSF2 WSF5
## 17 5 AWND WDF2 WDF5 WSF2 WSF5
## 18 1 AWND WDF2 WDF5 WSF2 WSF5
## 18 2 AWND WDF2 WDF5 WSF2 WSF5
## 18 3 AWND WDF2 WDF5 WSF2 WSF5
## 18 4 AWND WDF2 WDF5 WSF2 WSF5
## 18 5 AWND WDF2 WDF5 WSF2 WSF5
## 19 1 AWND WDF2 WDF5 WSF2 WSF5
## 19 2 AWND WDF2 WDF5 WSF2 WSF5
## 19 3 AWND WDF2 WDF5 WSF2 WSF5
## 19 4 AWND WDF2 WDF5 WSF2 WSF5
## 19 5 AWND WDF2 WDF5 WSF2 WSF5
## 20 1 AWND WDF2 WDF5 WSF2 WSF5
## 20 2 AWND WDF2 WDF5 WSF2 WSF5
## 20 3 AWND WDF2 WDF5 WSF2 WSF5
## 20 4 AWND WDF2 WDF5 WSF2 WSF5
## 20 5 AWND WDF2 WDF5 WSF2 WSF5
## 21 1 AWND WDF2 WDF5 WSF2 WSF5
## 21 2 AWND WDF2 WDF5 WSF2 WSF5
## 21 3 AWND WDF2 WDF5 WSF2 WSF5
## 21 4 AWND WDF2 WDF5 WSF2 WSF5
## 21 5 AWND WDF2 WDF5 WSF2 WSF5
## 22 1 AWND WDF2 WDF5 WSF2 WSF5
## 22 2 AWND WDF2 WDF5 WSF2 WSF5
## 22 3 AWND WDF2 WDF5 WSF2 WSF5
## 22 4 AWND WDF2 WDF5 WSF2 WSF5
## 22 5 AWND WDF2 WDF5 WSF2 WSF5
## 23 1 AWND WDF2 WDF5 WSF2 WSF5
## 23 2 AWND WDF2 WDF5 WSF2 WSF5
## 23 3 AWND WDF2 WDF5 WSF2 WSF5
## 23 4 AWND WDF2 WDF5 WSF2 WSF5
## 23 5 AWND WDF2 WDF5 WSF2 WSF5
## 24 1 AWND WDF2 WDF5 WSF2 WSF5
## 24 2 AWND WDF2 WDF5 WSF2 WSF5
## 24 3 AWND WDF2 WDF5 WSF2 WSF5
## 24 4 AWND WDF2 WDF5 WSF2 WSF5
## 24 5 AWND WDF2 WDF5 WSF2 WSF5
## 25 1 AWND WDF2 WDF5 WSF2 WSF5
## 25 2 AWND WDF2 WDF5 WSF2 WSF5
## 25 3 AWND WDF2 WDF5 WSF2 WSF5
## 25 4 AWND WDF2 WDF5 WSF2 WSF5
## 25 5 AWND WDF2 WDF5 WSF2 WSF5
## 26 1 AWND WDF2 WDF5 WSF2 WSF5
## 26 2 AWND WDF2 WDF5 WSF2 WSF5
## 26 3 AWND WDF2 WDF5 WSF2 WSF5
## 26 4 AWND WDF2 WDF5 WSF2 WSF5
## 26 5 AWND WDF2 WDF5 WSF2 WSF5
## 27 1 AWND WDF2 WDF5 WSF2 WSF5
## 27 2 AWND WDF2 WDF5 WSF2 WSF5
## 27 3 AWND WDF2 WDF5 WSF2 WSF5
## 27 4 AWND WDF2 WDF5 WSF2 WSF5
## 27 5 AWND WDF2 WDF5 WSF2 WSF5
## 28 1 AWND WDF2 WDF5 WSF2 WSF5
## 28 2 AWND WDF2 WDF5 WSF2 WSF5
## 28 3 AWND WDF2 WDF5 WSF2 WSF5
## 28 4 AWND WDF2 WDF5 WSF2 WSF5
## 28 5 AWND WDF2 WDF5 WSF2 WSF5
## 29 1 AWND WDF2 WDF5 WSF2 WSF5
## 29 2 AWND WDF2 WDF5 WSF2 WSF5
## 29 3 AWND WDF2 WDF5 WSF2 WSF5
## 29 4 AWND WDF2 WDF5 WSF2 WSF5
## 29 5 AWND WDF2 WDF5 WSF2 WSF5
## 30 1 AWND WDF2 WDF5 WSF2 WSF5
## 30 2 AWND WDF2 WDF5 WSF2 WSF5
## 30 3 AWND WDF2 WDF5 WSF2 WSF5
## 30 4 AWND WDF2 WDF5 WSF2 WSF5
## 30 5 AWND WDF2 WDF5 WSF2 WSF5
## 31 1 AWND WDF2 WDF5 WSF2 WSF5
## 31 2 AWND WDF2 WDF5 WSF2 WSF5
## 31 3 AWND WDF2 WDF5 WSF2 WSF5
## 31 4 AWND WDF2 WDF5 WSF2 WSF5
## 31 5 AWND WDF2 WDF5 WSF2 WSF5
## 32 1 AWND WDF2 WDF5 WSF2 WSF5
## 32 2 AWND WDF2 WDF5 WSF2 WSF5
## 32 3 AWND WDF2 WDF5 WSF2 WSF5
## 32 4 AWND WDF2 WDF5 WSF2 WSF5
## 32 5 AWND WDF2 WDF5 WSF2 WSF5
## 33 1 AWND WDF2 WDF5 WSF2 WSF5
## 33 2 AWND WDF2 WDF5 WSF2 WSF5
## 33 3 AWND WDF2 WDF5 WSF2 WSF5
## 33 4 AWND WDF2 WDF5 WSF2 WSF5
## 33 5 AWND WDF2 WDF5 WSF2 WSF5
## 34 1 AWND WDF2 WDF5 WSF2 WSF5
## 34 2 AWND WDF2 WDF5 WSF2 WSF5
## 34 3 AWND WDF2 WDF5 WSF2 WSF5
## 34 4 AWND WDF2 WDF5 WSF2 WSF5
## 34 5 AWND WDF2 WDF5 WSF2 WSF5
## 35 1 AWND WDF2 WDF5 WSF2 WSF5
## 35 2 AWND WDF2 WDF5 WSF2 WSF5
## 35 3 AWND WDF2 WDF5 WSF2 WSF5
## 35 4 AWND WDF2 WDF5 WSF2 WSF5
## 35 5 AWND WDF2 WDF5 WSF2 WSF5
## 36 1 AWND WDF2 WDF5 WSF2 WSF5
## 36 2 AWND WDF2 WDF5 WSF2 WSF5
## 36 3 AWND WDF2 WDF5 WSF2 WSF5
## 36 4 AWND WDF2 WDF5 WSF2 WSF5
## 36 5 AWND WDF2 WDF5 WSF2 WSF5
## 37 1 AWND WDF2 WDF5 WSF2 WSF5
## 37 2 AWND WDF2 WDF5 WSF2 WSF5
## 37 3 AWND WDF2 WDF5 WSF2 WSF5
## 37 4 AWND WDF2 WDF5 WSF2 WSF5
## 37 5 AWND WDF2 WDF5 WSF2 WSF5
## 38 1 AWND WDF2 WDF5 WSF2 WSF5
## 38 2 AWND WDF2 WDF5 WSF2 WSF5
## 38 3 AWND WDF2 WDF5 WSF2 WSF5
## 38 4 AWND WDF2 WDF5 WSF2 WSF5
## 38 5 AWND WDF2 WDF5 WSF2 WSF5
## 39 1 AWND WDF2 WDF5 WSF2 WSF5
## 39 2 AWND WDF2 WDF5 WSF2 WSF5
## 39 3 AWND WDF2 WDF5 WSF2 WSF5
## 39 4 AWND WDF2 WDF5 WSF2 WSF5
## 39 5 AWND WDF2 WDF5 WSF2 WSF5
## 40 1 AWND WDF2 WDF5 WSF2 WSF5
## 40 2 AWND WDF2 WDF5 WSF2 WSF5
## 40 3 AWND WDF2 WDF5 WSF2 WSF5
## 40 4 AWND WDF2 WDF5 WSF2 WSF5
## 40 5 AWND WDF2 WDF5 WSF2 WSF5
## 41 1 AWND WDF2 WDF5 WSF2 WSF5
## 41 2 AWND WDF2 WDF5 WSF2 WSF5
## 41 3 AWND WDF2 WDF5 WSF2 WSF5
## 41 4 AWND WDF2 WDF5 WSF2 WSF5
## 41 5 AWND WDF2 WDF5 WSF2 WSF5
## 42 1 AWND WDF2 WDF5 WSF2 WSF5
## 42 2 AWND WDF2 WDF5 WSF2 WSF5
## 42 3 AWND WDF2 WDF5 WSF2 WSF5
## 42 4 AWND WDF2 WDF5 WSF2 WSF5
## 42 5 AWND WDF2 WDF5 WSF2 WSF5
## 43 1 AWND WDF2 WDF5 WSF2 WSF5
## 43 2 AWND WDF2 WDF5 WSF2 WSF5
## 43 3 AWND WDF2 WDF5 WSF2 WSF5
## 43 4 AWND WDF2 WDF5 WSF2 WSF5
## 43 5 AWND WDF2 WDF5 WSF2 WSF5
## 44 1 AWND WDF2 WDF5 WSF2 WSF5
## 44 2 AWND WDF2 WDF5 WSF2 WSF5
## 44 3 AWND WDF2 WDF5 WSF2 WSF5
## 44 4 AWND WDF2 WDF5 WSF2 WSF5
## 44 5 AWND WDF2 WDF5 WSF2 WSF5
## 45 1 AWND WDF2 WDF5 WSF2 WSF5
## 45 2 AWND WDF2 WDF5 WSF2 WSF5
## 45 3 AWND WDF2 WDF5 WSF2 WSF5
## 45 4 AWND WDF2 WDF5 WSF2 WSF5
## 45 5 AWND WDF2 WDF5 WSF2 WSF5
## 46 1 AWND WDF2 WDF5 WSF2 WSF5
## 46 2 AWND WDF2 WDF5 WSF2 WSF5
## 46 3 AWND WDF2 WDF5 WSF2 WSF5
## 46 4 AWND WDF2 WDF5 WSF2 WSF5
## 46 5 AWND WDF2 WDF5 WSF2 WSF5
## 47 1 AWND WDF2 WDF5 WSF2 WSF5
## 47 2 AWND WDF2 WDF5 WSF2 WSF5
## 47 3 AWND WDF2 WDF5 WSF2 WSF5
## 47 4 AWND WDF2 WDF5 WSF2 WSF5
## 47 5 AWND WDF2 WDF5 WSF2 WSF5
## 48 1 AWND WDF2 WDF5 WSF2 WSF5
## 48 2 AWND WDF2 WDF5 WSF2 WSF5
## 48 3 AWND WDF2 WDF5 WSF2 WSF5
## 48 4 AWND WDF2 WDF5 WSF2 WSF5
## 48 5 AWND WDF2 WDF5 WSF2 WSF5
## 49 1 AWND WDF2 WDF5 WSF2 WSF5
## 49 2 AWND WDF2 WDF5 WSF2 WSF5
## 49 3 AWND WDF2 WDF5 WSF2 WSF5
## 49 4 AWND WDF2 WDF5 WSF2 WSF5
## 49 5 AWND WDF2 WDF5 WSF2 WSF5
## 50 1 AWND WDF2 WDF5 WSF2 WSF5
## 50 2 AWND WDF2 WDF5 WSF2 WSF5
## 50 3 AWND WDF2 WDF5 WSF2 WSF5
## 50 4 AWND WDF2 WDF5 WSF2 WSF5
## 50 5 AWND WDF2 WDF5 WSF2 WSF5
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## DATE AWND TMAX TMIN WDF2 WDF5 WSF2
## "" "pmm" "" "" "pmm" "pmm" "pmm"
## WSF5 user_type train gender sum_duration median_duration sum_distance_km
## "pmm" "" "" "" "" "" ""
## avg_age trips
## "" ""
## PredictorMatrix:
## DATE AWND TMAX TMIN WDF2 WDF5 WSF2 WSF5 user_type train gender sum_duration median_duration sum_distance_km
## DATE 0 1 1 1 1 1 1 1 1 0 1 1 1 1
## AWND 1 0 1 1 1 1 1 1 1 0 1 1 1 1
## TMAX 1 1 0 1 1 1 1 1 1 0 1 1 1 1
## TMIN 1 1 1 0 1 1 1 1 1 0 1 1 1 1
## WDF2 1 1 1 1 0 1 1 1 1 0 1 1 1 1
## WDF5 1 1 1 1 1 0 1 1 1 0 1 1 1 1
## avg_age trips
## DATE 1 1
## AWND 1 1
## TMAX 1 1
## TMIN 1 1
## WDF2 1 1
## WDF5 1 1
## Number of logged events: 1
## it im dep meth out
## 1 0 0 constant train
##
## iter imp variable
## 1 1 AWND WDF2 WDF5 WSF2 WSF5
## 1 2 AWND WDF2 WDF5 WSF2 WSF5
## 1 3 AWND WDF2 WDF5 WSF2 WSF5
## 1 4 AWND WDF2 WDF5 WSF2 WSF5
## 1 5 AWND WDF2 WDF5 WSF2 WSF5
## 2 1 AWND WDF2 WDF5 WSF2 WSF5
## 2 2 AWND WDF2 WDF5 WSF2 WSF5
## 2 3 AWND WDF2 WDF5 WSF2 WSF5
## 2 4 AWND WDF2 WDF5 WSF2 WSF5
## 2 5 AWND WDF2 WDF5 WSF2 WSF5
## 3 1 AWND WDF2 WDF5 WSF2 WSF5
## 3 2 AWND WDF2 WDF5 WSF2 WSF5
## 3 3 AWND WDF2 WDF5 WSF2 WSF5
## 3 4 AWND WDF2 WDF5 WSF2 WSF5
## 3 5 AWND WDF2 WDF5 WSF2 WSF5
## 4 1 AWND WDF2 WDF5 WSF2 WSF5
## 4 2 AWND WDF2 WDF5 WSF2 WSF5
## 4 3 AWND WDF2 WDF5 WSF2 WSF5
## 4 4 AWND WDF2 WDF5 WSF2 WSF5
## 4 5 AWND WDF2 WDF5 WSF2 WSF5
## 5 1 AWND WDF2 WDF5 WSF2 WSF5
## 5 2 AWND WDF2 WDF5 WSF2 WSF5
## 5 3 AWND WDF2 WDF5 WSF2 WSF5
## 5 4 AWND WDF2 WDF5 WSF2 WSF5
## 5 5 AWND WDF2 WDF5 WSF2 WSF5
## 6 1 AWND WDF2 WDF5 WSF2 WSF5
## 6 2 AWND WDF2 WDF5 WSF2 WSF5
## 6 3 AWND WDF2 WDF5 WSF2 WSF5
## 6 4 AWND WDF2 WDF5 WSF2 WSF5
## 6 5 AWND WDF2 WDF5 WSF2 WSF5
## 7 1 AWND WDF2 WDF5 WSF2 WSF5
## 7 2 AWND WDF2 WDF5 WSF2 WSF5
## 7 3 AWND WDF2 WDF5 WSF2 WSF5
## 7 4 AWND WDF2 WDF5 WSF2 WSF5
## 7 5 AWND WDF2 WDF5 WSF2 WSF5
## 8 1 AWND WDF2 WDF5 WSF2 WSF5
## 8 2 AWND WDF2 WDF5 WSF2 WSF5
## 8 3 AWND WDF2 WDF5 WSF2 WSF5
## 8 4 AWND WDF2 WDF5 WSF2 WSF5
## 8 5 AWND WDF2 WDF5 WSF2 WSF5
## 9 1 AWND WDF2 WDF5 WSF2 WSF5
## 9 2 AWND WDF2 WDF5 WSF2 WSF5
## 9 3 AWND WDF2 WDF5 WSF2 WSF5
## 9 4 AWND WDF2 WDF5 WSF2 WSF5
## 9 5 AWND WDF2 WDF5 WSF2 WSF5
## 10 1 AWND WDF2 WDF5 WSF2 WSF5
## 10 2 AWND WDF2 WDF5 WSF2 WSF5
## 10 3 AWND WDF2 WDF5 WSF2 WSF5
## 10 4 AWND WDF2 WDF5 WSF2 WSF5
## 10 5 AWND WDF2 WDF5 WSF2 WSF5
## 11 1 AWND WDF2 WDF5 WSF2 WSF5
## 11 2 AWND WDF2 WDF5 WSF2 WSF5
## 11 3 AWND WDF2 WDF5 WSF2 WSF5
## 11 4 AWND WDF2 WDF5 WSF2 WSF5
## 11 5 AWND WDF2 WDF5 WSF2 WSF5
## 12 1 AWND WDF2 WDF5 WSF2 WSF5
## 12 2 AWND WDF2 WDF5 WSF2 WSF5
## 12 3 AWND WDF2 WDF5 WSF2 WSF5
## 12 4 AWND WDF2 WDF5 WSF2 WSF5
## 12 5 AWND WDF2 WDF5 WSF2 WSF5
## 13 1 AWND WDF2 WDF5 WSF2 WSF5
## 13 2 AWND WDF2 WDF5 WSF2 WSF5
## 13 3 AWND WDF2 WDF5 WSF2 WSF5
## 13 4 AWND WDF2 WDF5 WSF2 WSF5
## 13 5 AWND WDF2 WDF5 WSF2 WSF5
## 14 1 AWND WDF2 WDF5 WSF2 WSF5
## 14 2 AWND WDF2 WDF5 WSF2 WSF5
## 14 3 AWND WDF2 WDF5 WSF2 WSF5
## 14 4 AWND WDF2 WDF5 WSF2 WSF5
## 14 5 AWND WDF2 WDF5 WSF2 WSF5
## 15 1 AWND WDF2 WDF5 WSF2 WSF5
## 15 2 AWND WDF2 WDF5 WSF2 WSF5
## 15 3 AWND WDF2 WDF5 WSF2 WSF5
## 15 4 AWND WDF2 WDF5 WSF2 WSF5
## 15 5 AWND WDF2 WDF5 WSF2 WSF5
## 16 1 AWND WDF2 WDF5 WSF2 WSF5
## 16 2 AWND WDF2 WDF5 WSF2 WSF5
## 16 3 AWND WDF2 WDF5 WSF2 WSF5
## 16 4 AWND WDF2 WDF5 WSF2 WSF5
## 16 5 AWND WDF2 WDF5 WSF2 WSF5
## 17 1 AWND WDF2 WDF5 WSF2 WSF5
## 17 2 AWND WDF2 WDF5 WSF2 WSF5
## 17 3 AWND WDF2 WDF5 WSF2 WSF5
## 17 4 AWND WDF2 WDF5 WSF2 WSF5
## 17 5 AWND WDF2 WDF5 WSF2 WSF5
## 18 1 AWND WDF2 WDF5 WSF2 WSF5
## 18 2 AWND WDF2 WDF5 WSF2 WSF5
## 18 3 AWND WDF2 WDF5 WSF2 WSF5
## 18 4 AWND WDF2 WDF5 WSF2 WSF5
## 18 5 AWND WDF2 WDF5 WSF2 WSF5
## 19 1 AWND WDF2 WDF5 WSF2 WSF5
## 19 2 AWND WDF2 WDF5 WSF2 WSF5
## 19 3 AWND WDF2 WDF5 WSF2 WSF5
## 19 4 AWND WDF2 WDF5 WSF2 WSF5
## 19 5 AWND WDF2 WDF5 WSF2 WSF5
## 20 1 AWND WDF2 WDF5 WSF2 WSF5
## 20 2 AWND WDF2 WDF5 WSF2 WSF5
## 20 3 AWND WDF2 WDF5 WSF2 WSF5
## 20 4 AWND WDF2 WDF5 WSF2 WSF5
## 20 5 AWND WDF2 WDF5 WSF2 WSF5
## 21 1 AWND WDF2 WDF5 WSF2 WSF5
## 21 2 AWND WDF2 WDF5 WSF2 WSF5
## 21 3 AWND WDF2 WDF5 WSF2 WSF5
## 21 4 AWND WDF2 WDF5 WSF2 WSF5
## 21 5 AWND WDF2 WDF5 WSF2 WSF5
## 22 1 AWND WDF2 WDF5 WSF2 WSF5
## 22 2 AWND WDF2 WDF5 WSF2 WSF5
## 22 3 AWND WDF2 WDF5 WSF2 WSF5
## 22 4 AWND WDF2 WDF5 WSF2 WSF5
## 22 5 AWND WDF2 WDF5 WSF2 WSF5
## 23 1 AWND WDF2 WDF5 WSF2 WSF5
## 23 2 AWND WDF2 WDF5 WSF2 WSF5
## 23 3 AWND WDF2 WDF5 WSF2 WSF5
## 23 4 AWND WDF2 WDF5 WSF2 WSF5
## 23 5 AWND WDF2 WDF5 WSF2 WSF5
## 24 1 AWND WDF2 WDF5 WSF2 WSF5
## 24 2 AWND WDF2 WDF5 WSF2 WSF5
## 24 3 AWND WDF2 WDF5 WSF2 WSF5
## 24 4 AWND WDF2 WDF5 WSF2 WSF5
## 24 5 AWND WDF2 WDF5 WSF2 WSF5
## 25 1 AWND WDF2 WDF5 WSF2 WSF5
## 25 2 AWND WDF2 WDF5 WSF2 WSF5
## 25 3 AWND WDF2 WDF5 WSF2 WSF5
## 25 4 AWND WDF2 WDF5 WSF2 WSF5
## 25 5 AWND WDF2 WDF5 WSF2 WSF5
## 26 1 AWND WDF2 WDF5 WSF2 WSF5
## 26 2 AWND WDF2 WDF5 WSF2 WSF5
## 26 3 AWND WDF2 WDF5 WSF2 WSF5
## 26 4 AWND WDF2 WDF5 WSF2 WSF5
## 26 5 AWND WDF2 WDF5 WSF2 WSF5
## 27 1 AWND WDF2 WDF5 WSF2 WSF5
## 27 2 AWND WDF2 WDF5 WSF2 WSF5
## 27 3 AWND WDF2 WDF5 WSF2 WSF5
## 27 4 AWND WDF2 WDF5 WSF2 WSF5
## 27 5 AWND WDF2 WDF5 WSF2 WSF5
## 28 1 AWND WDF2 WDF5 WSF2 WSF5
## 28 2 AWND WDF2 WDF5 WSF2 WSF5
## 28 3 AWND WDF2 WDF5 WSF2 WSF5
## 28 4 AWND WDF2 WDF5 WSF2 WSF5
## 28 5 AWND WDF2 WDF5 WSF2 WSF5
## 29 1 AWND WDF2 WDF5 WSF2 WSF5
## 29 2 AWND WDF2 WDF5 WSF2 WSF5
## 29 3 AWND WDF2 WDF5 WSF2 WSF5
## 29 4 AWND WDF2 WDF5 WSF2 WSF5
## 29 5 AWND WDF2 WDF5 WSF2 WSF5
## 30 1 AWND WDF2 WDF5 WSF2 WSF5
## 30 2 AWND WDF2 WDF5 WSF2 WSF5
## 30 3 AWND WDF2 WDF5 WSF2 WSF5
## 30 4 AWND WDF2 WDF5 WSF2 WSF5
## 30 5 AWND WDF2 WDF5 WSF2 WSF5
## 31 1 AWND WDF2 WDF5 WSF2 WSF5
## 31 2 AWND WDF2 WDF5 WSF2 WSF5
## 31 3 AWND WDF2 WDF5 WSF2 WSF5
## 31 4 AWND WDF2 WDF5 WSF2 WSF5
## 31 5 AWND WDF2 WDF5 WSF2 WSF5
## 32 1 AWND WDF2 WDF5 WSF2 WSF5
## 32 2 AWND WDF2 WDF5 WSF2 WSF5
## 32 3 AWND WDF2 WDF5 WSF2 WSF5
## 32 4 AWND WDF2 WDF5 WSF2 WSF5
## 32 5 AWND WDF2 WDF5 WSF2 WSF5
## 33 1 AWND WDF2 WDF5 WSF2 WSF5
## 33 2 AWND WDF2 WDF5 WSF2 WSF5
## 33 3 AWND WDF2 WDF5 WSF2 WSF5
## 33 4 AWND WDF2 WDF5 WSF2 WSF5
## 33 5 AWND WDF2 WDF5 WSF2 WSF5
## 34 1 AWND WDF2 WDF5 WSF2 WSF5
## 34 2 AWND WDF2 WDF5 WSF2 WSF5
## 34 3 AWND WDF2 WDF5 WSF2 WSF5
## 34 4 AWND WDF2 WDF5 WSF2 WSF5
## 34 5 AWND WDF2 WDF5 WSF2 WSF5
## 35 1 AWND WDF2 WDF5 WSF2 WSF5
## 35 2 AWND WDF2 WDF5 WSF2 WSF5
## 35 3 AWND WDF2 WDF5 WSF2 WSF5
## 35 4 AWND WDF2 WDF5 WSF2 WSF5
## 35 5 AWND WDF2 WDF5 WSF2 WSF5
## 36 1 AWND WDF2 WDF5 WSF2 WSF5
## 36 2 AWND WDF2 WDF5 WSF2 WSF5
## 36 3 AWND WDF2 WDF5 WSF2 WSF5
## 36 4 AWND WDF2 WDF5 WSF2 WSF5
## 36 5 AWND WDF2 WDF5 WSF2 WSF5
## 37 1 AWND WDF2 WDF5 WSF2 WSF5
## 37 2 AWND WDF2 WDF5 WSF2 WSF5
## 37 3 AWND WDF2 WDF5 WSF2 WSF5
## 37 4 AWND WDF2 WDF5 WSF2 WSF5
## 37 5 AWND WDF2 WDF5 WSF2 WSF5
## 38 1 AWND WDF2 WDF5 WSF2 WSF5
## 38 2 AWND WDF2 WDF5 WSF2 WSF5
## 38 3 AWND WDF2 WDF5 WSF2 WSF5
## 38 4 AWND WDF2 WDF5 WSF2 WSF5
## 38 5 AWND WDF2 WDF5 WSF2 WSF5
## 39 1 AWND WDF2 WDF5 WSF2 WSF5
## 39 2 AWND WDF2 WDF5 WSF2 WSF5
## 39 3 AWND WDF2 WDF5 WSF2 WSF5
## 39 4 AWND WDF2 WDF5 WSF2 WSF5
## 39 5 AWND WDF2 WDF5 WSF2 WSF5
## 40 1 AWND WDF2 WDF5 WSF2 WSF5
## 40 2 AWND WDF2 WDF5 WSF2 WSF5
## 40 3 AWND WDF2 WDF5 WSF2 WSF5
## 40 4 AWND WDF2 WDF5 WSF2 WSF5
## 40 5 AWND WDF2 WDF5 WSF2 WSF5
## 41 1 AWND WDF2 WDF5 WSF2 WSF5
## 41 2 AWND WDF2 WDF5 WSF2 WSF5
## 41 3 AWND WDF2 WDF5 WSF2 WSF5
## 41 4 AWND WDF2 WDF5 WSF2 WSF5
## 41 5 AWND WDF2 WDF5 WSF2 WSF5
## 42 1 AWND WDF2 WDF5 WSF2 WSF5
## 42 2 AWND WDF2 WDF5 WSF2 WSF5
## 42 3 AWND WDF2 WDF5 WSF2 WSF5
## 42 4 AWND WDF2 WDF5 WSF2 WSF5
## 42 5 AWND WDF2 WDF5 WSF2 WSF5
## 43 1 AWND WDF2 WDF5 WSF2 WSF5
## 43 2 AWND WDF2 WDF5 WSF2 WSF5
## 43 3 AWND WDF2 WDF5 WSF2 WSF5
## 43 4 AWND WDF2 WDF5 WSF2 WSF5
## 43 5 AWND WDF2 WDF5 WSF2 WSF5
## 44 1 AWND WDF2 WDF5 WSF2 WSF5
## 44 2 AWND WDF2 WDF5 WSF2 WSF5
## 44 3 AWND WDF2 WDF5 WSF2 WSF5
## 44 4 AWND WDF2 WDF5 WSF2 WSF5
## 44 5 AWND WDF2 WDF5 WSF2 WSF5
## 45 1 AWND WDF2 WDF5 WSF2 WSF5
## 45 2 AWND WDF2 WDF5 WSF2 WSF5
## 45 3 AWND WDF2 WDF5 WSF2 WSF5
## 45 4 AWND WDF2 WDF5 WSF2 WSF5
## 45 5 AWND WDF2 WDF5 WSF2 WSF5
## 46 1 AWND WDF2 WDF5 WSF2 WSF5
## 46 2 AWND WDF2 WDF5 WSF2 WSF5
## 46 3 AWND WDF2 WDF5 WSF2 WSF5
## 46 4 AWND WDF2 WDF5 WSF2 WSF5
## 46 5 AWND WDF2 WDF5 WSF2 WSF5
## 47 1 AWND WDF2 WDF5 WSF2 WSF5
## 47 2 AWND WDF2 WDF5 WSF2 WSF5
## 47 3 AWND WDF2 WDF5 WSF2 WSF5
## 47 4 AWND WDF2 WDF5 WSF2 WSF5
## 47 5 AWND WDF2 WDF5 WSF2 WSF5
## 48 1 AWND WDF2 WDF5 WSF2 WSF5
## 48 2 AWND WDF2 WDF5 WSF2 WSF5
## 48 3 AWND WDF2 WDF5 WSF2 WSF5
## 48 4 AWND WDF2 WDF5 WSF2 WSF5
## 48 5 AWND WDF2 WDF5 WSF2 WSF5
## 49 1 AWND WDF2 WDF5 WSF2 WSF5
## 49 2 AWND WDF2 WDF5 WSF2 WSF5
## 49 3 AWND WDF2 WDF5 WSF2 WSF5
## 49 4 AWND WDF2 WDF5 WSF2 WSF5
## 49 5 AWND WDF2 WDF5 WSF2 WSF5
## 50 1 AWND WDF2 WDF5 WSF2 WSF5
## 50 2 AWND WDF2 WDF5 WSF2 WSF5
## 50 3 AWND WDF2 WDF5 WSF2 WSF5
## 50 4 AWND WDF2 WDF5 WSF2 WSF5
## 50 5 AWND WDF2 WDF5 WSF2 WSF5
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## DATE AWND TMAX TMIN WDF2 WDF5 WSF2
## "" "pmm" "" "" "pmm" "pmm" "pmm"
## WSF5 user_type train gender sum_duration median_duration sum_distance_km
## "pmm" "" "" "" "" "" ""
## avg_age trips
## "" ""
## PredictorMatrix:
## DATE AWND TMAX TMIN WDF2 WDF5 WSF2 WSF5 user_type train gender sum_duration median_duration sum_distance_km
## DATE 0 1 1 1 1 1 1 1 1 0 1 1 1 1
## AWND 1 0 1 1 1 1 1 1 1 0 1 1 1 1
## TMAX 1 1 0 1 1 1 1 1 1 0 1 1 1 1
## TMIN 1 1 1 0 1 1 1 1 1 0 1 1 1 1
## WDF2 1 1 1 1 0 1 1 1 1 0 1 1 1 1
## WDF5 1 1 1 1 1 0 1 1 1 0 1 1 1 1
## avg_age trips
## DATE 1 1
## AWND 1 1
## TMAX 1 1
## TMIN 1 1
## WDF2 1 1
## WDF5 1 1
## Number of logged events: 1251
## it im dep meth out
## 1 0 0 constant train
## 2 1 1 AWND pmm user_typeUNKNOWN
## 3 1 1 WDF2 pmm user_typeUNKNOWN
## 4 1 1 WDF5 pmm user_typeUNKNOWN
## 5 1 1 WSF2 pmm user_typeUNKNOWN
## 6 1 1 WSF5 pmm user_typeUNKNOWN
## parsnip model object
##
## Fit time: 2.7s
## Ranger result
##
## Call:
## ranger::ranger(formula = formula, data = data, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 500
## Sample size: 5202
## Number of independent variables: 15
## Mtry: 3
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 2.942535
## R squared (OOB): 0.9796524
## # A tibble: 2 x 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 lm rmse standard 2.27
## 2 rf rmse standard 0.778
## # A tibble: 2 x 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 lm rmse standard 2.56
## 2 rf rmse standard 2.22
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 rmse standard 1.70 10 0.0122
## 2 rsq standard 0.980 10 0.000510
The rmse of lm is ‘r rmse_train\(estimate[rmse_train\)model==’lm’]‘, and random forest is ’r rmse_train\(estimate[rmse_train\)model==’rf’]‘. Random Forest performes much better than lm. We resample the training set to produce an estimate of how the model will perform. The result is much better with rmse ’r rmse_rf\(mean[rmse_train\)metric==’rmse’]’
The model uses tidymodels package and takes considerations of holidays into data preparation, also the model runs through grid search on penalty with validation dataset.
## # Validation Set Split (0.8/0.2) using stratification
## # A tibble: 1 x 2
## splits id
## <named list> <chr>
## 1 <split [4.1K/1K]> validation
## # A tibble: 5 x 1
## penalty
## <dbl>
## 1 0.0001
## 2 0.000127
## 3 0.000161
## 4 0.000204
## 5 0.000259
## # A tibble: 5 x 1
## penalty
## <dbl>
## 1 0.0386
## 2 0.0489
## 3 0.0621
## 4 0.0788
## 5 0.1
## penalty .metric .estimator mean n std_err
## 1 0.0001000000 rmse standard 2.226662 1 NA
## 2 0.0001268961 rmse standard 2.226662 1 NA
## 3 0.0001610262 rmse standard 2.226662 1 NA
## 4 0.0002043360 rmse standard 2.226662 1 NA
## 5 0.0002592944 rmse standard 2.226662 1 NA
## 6 0.0003290345 rmse standard 2.226662 1 NA
## 7 0.0004175319 rmse standard 2.226662 1 NA
## 8 0.0005298317 rmse standard 2.226662 1 NA
## 9 0.0006723358 rmse standard 2.226662 1 NA
## 10 0.0008531679 rmse standard 2.226662 1 NA
## 11 0.0010826367 rmse standard 2.226662 1 NA
## 12 0.0013738238 rmse standard 2.226662 1 NA
## 13 0.0017433288 rmse standard 2.226662 1 NA
## 14 0.0022122163 rmse standard 2.226662 1 NA
## 15 0.0028072162 rmse standard 2.226662 1 NA
## 16 0.0035622479 rmse standard 2.226662 1 NA
## 17 0.0045203537 rmse standard 2.226662 1 NA
## 18 0.0057361525 rmse standard 2.226662 1 NA
## 19 0.0072789538 rmse standard 2.226662 1 NA
## 20 0.0092367086 rmse standard 2.226662 1 NA
## 21 0.0117210230 rmse standard 2.226651 1 NA
## 22 0.0148735211 rmse standard 2.226492 1 NA
## 23 0.0188739182 rmse standard 2.226671 1 NA
## 24 0.0239502662 rmse standard 2.227317 1 NA
## 25 0.0303919538 rmse standard 2.228833 1 NA
## 26 0.0385662042 rmse standard 2.231191 1 NA
## 27 0.0489390092 rmse standard 2.235415 1 NA
## 28 0.0621016942 rmse standard 2.239631 1 NA
## 29 0.0788046282 rmse standard 2.246045 1 NA
## 30 0.1000000000 rmse standard 2.257211 1 NA
## # A tibble: 1 x 6
## penalty .metric .estimator mean n std_err
## <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0149 rmse standard 2.23 1 NA
## # A tibble: 1 x 6
## penalty .metric .estimator mean n std_err
## <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0149 rmse standard 2.23 1 NA
## # A tibble: 1,031 x 6
## id .pred .row penalty trips model
## <chr> <dbl> <int> <dbl> <int> <chr>
## 1 validation 1.56 5 0.0149 2 glmnet
## 2 validation 4.96 14 0.0149 4 glmnet
## 3 validation 5.91 17 0.0149 7 glmnet
## 4 validation 7.07 24 0.0149 5 glmnet
## 5 validation 7.72 32 0.0149 6 glmnet
## 6 validation 19.8 33 0.0149 18 glmnet
## 7 validation 15.0 35 0.0149 18 glmnet
## 8 validation 26.9 37 0.0149 27 glmnet
## 9 validation 7.35 42 0.0149 8 glmnet
## 10 validation 10.3 43 0.0149 9 glmnet
## # … with 1,021 more rows
The glmnet model runs training model with grid search penalty associated with lowerest rmse, and receives mean of rmse ‘r lr_rmse$mean’.
## [1] 12
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
##
## Engine-Specific Arguments:
## num.threads = cores
##
## Computational engine: ranger
## Collection of 2 parameters for tuning
##
## id parameter type object class
## mtry mtry nparam[?]
## min_n min_n nparam[+]
##
## Model parameters needing finalization:
## # Randomly Selected Predictors ('mtry')
##
## See `?dials::finalize` or `?dials::update.parameters` for more information.
## # A tibble: 1 x 7
## mtry min_n .metric .estimator mean n std_err
## <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 14 8 rmse standard 1.58 1 NA
## # A tibble: 1 x 2
## mtry min_n
## <int> <int>
## 1 14 8
## # A tibble: 25,775 x 6
## id .pred .row mtry min_n trips
## <chr> <dbl> <int> <int> <int> <int>
## 1 validation 2.22 5 13 30 2
## 2 validation 4.77 14 13 30 4
## 3 validation 6.74 17 13 30 7
## 4 validation 6.54 24 13 30 5
## 5 validation 7.28 32 13 30 6
## 6 validation 19.2 33 13 30 18
## 7 validation 15.7 35 13 30 18
## 8 validation 27.1 37 13 30 27
## 9 validation 8.26 42 13 30 8
## 10 validation 9.68 43 13 30 9
## # … with 25,765 more rows
Random forest model has a better rmse result than glmnet model ‘r top_rf\(mean' and 'r top_glmnet\)mean’.
The random forest model is built with the best grid search results and tested using hold out test dataset. The important variables are listed.
## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [5.2K/1.7K]> train/test split <tibble [2 × 3]> <tibble [0 × 1]> <tibble [1,721 × 3]> <workflow>
The model test result shows that the rmses of training and testing are very close: ‘r top_rf\(mean' and 'r last_rmse\).estimate[last_rmse$.metric==’rmse’]’, which shows a good model on prediction.